Skip to content

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup#199

Merged
xdotli merged 28 commits intomainfrom
dev-0.3
Apr 25, 2026
Merged

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup#199
xdotli merged 28 commits intomainfrom
dev-0.3

Conversation

@xdotli
Copy link
Copy Markdown
Member

@xdotli xdotli commented Apr 25, 2026

Release: v0.3.2

Cuts dev-0.3 → main as the v0.3.2 release. Tag v0.3.2 to be created on main after this merges; CI publishes to PyPI.

What's in 0.3.2

Features

Fixes

Chores

Validation

  • All 7 release-critical PRs merged into dev-0.3 in sequence
  • ruff check . clean
  • Test suite passing modulo 8 pre-existing failures (env-pollution between subscription auth tests, Docker compose env, judge_model default mismatch — none caused by this release)
  • SWE-bench Pro oracle: 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser)
  • Single-round Gemini 3.1 Pro baseline: 2/4

Post-merge actions

  1. Tag v0.3.2 on main: git tag -a v0.3.2 -m "benchflow 0.3.2"; git push origin v0.3.2
  2. gh release create v0.3.2 --generate-notes (CI publishes to PyPI)
  3. Bump main pyproject.toml version to 0.3.3.dev0
  4. Delete dev-0.3 branch (going forward: trunk-based, PRs target main)

Test plan

  • CI runs against the merge commit
  • Devin reviews
  • Tag and publish after merge
  • pip install benchflow==0.3.2 works after CI publishes

Open in Devin Review

xdotli and others added 28 commits April 21, 2026 13:39
…169)

* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: openhands install — use uv tool install or pip install openhands-ai

The PyPI package 'openhands' (0.0.0) is a placeholder, not the CLI.
The real install is 'uv tool install openhands' (preferred) or
'pip install openhands-ai'. Tries uv first, falls back to pip.

Fixes #169 runtime error: 'openhands: command not found'

---------

Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Five fixes for issue #169 (openhands: command not found):

1. PATH: add $HOME/.local/bin to launch_cmd so uv-installed binary is found
2. Interpreter access: chmod o+x on /root path chain so sandbox user can
   reach the uv-managed Python shebang at /root/.local/share/uv/tools/
3. ACP auth: seed ~/.openhands/agent_settings.json at install (OpenHands
   _is_authenticated() requires it) and overwrite with real LLM_MODEL/KEY
   at launch (workaround for OpenHands ACP not applying --override-with-envs
   in _create_conversation)
4. Model env: add BENCHFLOW_PROVIDER_MODEL → LLM_MODEL to env_mapping
5. CWD: remove hardcoded cd /home/{user} from build_priv_drop_cmd — it
   overrode the docker -w /app workspace, causing agents to write files
   in the wrong directory

Also adds home_dirs=[".openhands"] so setup_sandbox_user copies the
settings dir to the agent user.

Tested: bench eval create + bench run, both sandbox_user=agent and root,
gemini agent regression-verified, 45/45 registry+sandbox tests pass.
…enes

Multi-role scenes (coder + reviewer) now communicate via outbox files
through the main bf.run(TrialConfig) path. Previously, outbox-based
message passing only worked through the standalone _scene.py scheduler
(used by followup-bench). Now the same convention works end-to-end:

  1. Scheduler sets up /app/.outbox/ before the first turn
  2. After each turn, reads outbox files written by the active role
  3. Injects received messages into the next role's prompt

Also includes:
- Coder-reviewer demo script (docs/notebooks/coder-reviewer-demo.py)
- Real runnable notebook replacing config-only cells with bf.run() calls
- Multi-turn vs multi-round terminology in README and api-reference
- 7 new tests covering outbox setup, injection, cleanup, and edge cases
1. Quote file paths with shlex.quote() in _read_scene_outbox() to
   prevent shell command injection via crafted outbox filenames
2. chown /app/.outbox to sandbox_user so agents can actually write
   outbox files (was root:root 755 → agent couldn't write)
…st gaps

1. Persist inter-role messages to trial_dir/scene_messages.jsonl
   (was ephemeral — injected into prompts then discarded)
2. Install non-primary agents in connect_as() for heterogeneous scenes
   (was broken: only primary agent was installed)
3. Honest Harbor mapping — document what 0.3 delivers vs what's a gap:
   - Shipped: roles, turns, outbox messaging, message persistence
   - Gap: dynamic termination, oracle access, per-round verification,
     inter-round trajectory inspection
4. Add 0.3 Limitations section to api-reference
5. Two new tests: message persistence + heterogeneous agent install
All 3 patterns executed end-to-end on regex-log task via Daytona:
- Baseline: reward=1.0, 3 tool calls
- Self-review (multi-turn): reward=1.0, 7 tool calls
- Coder-reviewer (multi-round): reward=0.0, 13 tool calls

Outbox messaging confirmed working: reviewer wrote feedback to
/app/.outbox/coder.json, scheduler read and injected into coder's
prompt. Messages persisted to scene_messages.jsonl.
…primary agents

1. connect_as() now writes credential files and uploads subscription
   auth for non-primary agents, matching what install_agent() does
   for the primary agent. Fixes heterogeneous scenes where e.g.
   codex-acp needs ~/.codex/auth.json.

2. connect_as() now updates self._agent_launch so disconnect()'s
   pkill fallback targets the correct process (not always the
   primary agent's binary).

3. Note: the openhands launch_cmd pkill issue (pkill -f 'export')
   is pre-existing in registry.py, not introduced by this PR.
Tasks requesting more storage than the Daytona tier allows fail at
sandbox creation. Apply the same clamping pattern already used for
cpus and memory_mb so tasks degrade gracefully. The cap is overridable
via BENCHFLOW_DAYTONA_MAX_STORAGE_MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: clamp Daytona storage_mb to configurable max
feat: wire outbox messaging into Trial._run_scene()
* Fix DinD compose exec missing project/directory/file flags

DaytonaProcess.start() hardcoded `docker compose exec` without the
`-p`, `--project-directory`, and `-f` flags needed to locate the
running compose project inside the DinD sandbox. This caused exec
to fail silently with "Process closed stdout (rc=None)".

Extract the full compose base command from Harbor's strategy via
`_compose_cmd([])` during `from_harbor_env()` and use it in `start()`
so the exec subcommand includes all required project identifiers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use shlex.join for DinD compose exec to handle paths with spaces

Address Devin review feedback — shlex.split() + " ".join() loses quoting
for tokens with spaces. Use shlex.join() which properly quotes each token.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fix: DinD compose exec now includes project/directory/file flags (#188)
- fix: clamp Daytona storage_mb to configurable max (#185)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…193)

SSH pipes break through the DinD→compose exec chain, causing
"Process closed stdout (rc=None)" on all compose tasks.

New DaytonaPtyProcess uses Daytona SDK's WebSocket PTY API for the
outer connection (keeps pipe alive), then docker compose exec -i -T
inside (clean stdio for the agent). Includes marker-based startup
to drain shell output before ACP handshake, and echo-resistant
response matching in the ACP client (filter echoed requests by
checking for 'method' field absence).

Also adds skills_dir: "auto" support in Job for per-task skill
resolution after PR #720 removed COPY skills from Dockerfiles.
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

---------

Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks
the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were
landing on top of pre-existing repo lint debt.

What changed:
1. Auto-fixes via `ruff check --fix --unsafe-fixes`:
   - 40 F401 unused-imports across src/, tests/, examples/
   - 8 I001 unsorted-imports
   - 6 UP037 quoted-annotations modernized
   - Other auto-fixable rules

2. Hand fixes:
   - src/benchflow/__init__.py: removed `Trial` from the `from harbor`
     re-export block (it was shadowed by `from benchflow.trial import Trial`
     at line 65, which is the canonical public Trial). Added
     `trial_config_from_yaml` to __all__.
   - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for
     B904 (errors raised inside except clauses).
   - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp
     ImportError reraise.
   - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)`
     pattern (RUF043).
   - 3 files: replaced `×` (Unicode multiplication sign) in comments and
     f-strings with `x` (latin x) to clear RUF001/RUF003.

3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`:
   - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these
     are standalone scripts that legitimately set sys.path before importing.
   - `src/benchflow/runtime.py` ignores F821 — uses forward references
     resolved by `from __future__ import annotations`; explicit
     TYPE_CHECKING imports would force eager loads.

No code behavior changes. 580 tests pass; the 8 pre-existing failures
(env-leak between subscription auth tests, Docker compose env, judge
model default mismatch) are unrelated to this PR.
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* docs: add fix plan for connect_as() agent_env bug (#2)

* docs: expand fix plan with eng review findings and test cases

Add two edge-case test requirements (non-overlapping key merge,
None safety) from /plan-eng-review. Append review report confirming
0 issues, 0 critical gaps — ready to implement.

* fix: merge cfg.agent_env into connect_as() env resolution (#2)

connect_as() passed only role.env to resolve_agent_env, losing all
config-level env vars (e.g. BENCHFLOW_PROVIDER_BASE_URL from YAML).
Merge cfg.agent_env as base with role.env overlay so role-specific
vars win on overlap.

* remove plan

---------

Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* rebase on upstream/0.3

* openhand cli add

* enhance api key security

* refine tests

Co-authored-by: Copilot <copilot@github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* docs: use `uv tool install` instead of `pip install`

benchflow is a CLI tool with entry points — uv tool install gives users
an isolated environment (like pipx) without managing venvs manually.

---------

Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* test: cover sandbox setup timeout wiring

* docs: document sandbox setup timeout

* feat: wire sandbox setup timeout through configs

`setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default
120s) but no live call site surfaced it — the knob was unreachable for
normal runs. Under heavy sandbox bootstrap (parallel containers copying
large tool caches into /home/<sandbox_user>) the 120s cap was hit with
no user override.

Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and
RuntimeConfig, and forward it through:
- trial YAML (`trial_config_from_dict`)
- job YAML (both native and Harbor-compatible loaders)
- `SDK.run(..., sandbox_setup_timeout=...)`
- `bench eval create --sandbox-setup-timeout`
- `Trial.install_agent()` into both `setup_sandbox_user()` call sites
  (oracle + normal agent)

The value is also recorded in the run's `config.json` snapshot to aid
post-hoc diagnosis. Default stays at 120s — this change is about making
the value configurable, not changing runtime behavior.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* docs(plan): add plan to fix sandbox io problem

* test: lock sandbox setup contract

Plan step 1/6: Lock the new sandbox contract in tests

* fix: stop copying root tool installs into sandbox home

Plan step 2/6: Narrow setup_sandbox_user() to user state only

* refactor: derive sandbox home dirs from registry config

Plan step 3/6: Align registry semantics with the new contract

* refactor: symlink skills into sandbox, enforce shared install prefixes

Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or
task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and
add registry + sandbox-setup invariants that keep agent binaries on
/usr/local/* rather than /root-only home paths. Updates task-authoring
and api-reference docs to describe the new lightweight sandbox contract.

* chore: remove completed sandbox plan doc

---------

Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
* feat: BaseUser abstraction for progressive-disclosure trial loops

Add User as a first-class participant in the trial loop — a Python
callback that produces prompts, sees test results between rounds, and
decides when to stop. This is the infrastructure Josh (GitHub/Microsoft)
needs for SWE-bench Pro progressive disclosure.

New types (user.py):
- BaseUser with setup(instruction, solution) and run(round, instruction, round_result)
- RoundResult dataclass with trajectory, rewards, verifier output
- PassthroughUser (backward-compat default, single round)
- FunctionUser (wraps a plain callback for lightweight use)

Trial changes:
- TrialConfig gains user, max_user_rounds, oracle_access fields
- Trial._run_user_loop(): user.run() → connect → execute → disconnect →
  soft_verify() → build RoundResult → repeat until None or max rounds
- Trial.soft_verify(): runs Harbor verifier WITHOUT hardening so agent
  stays alive between rounds. Final verify() still does full hardening.
- Multi-role + User raises ValueError (deferred to future phase)

16 new tests, 0 regressions on existing 618 tests.

* fix: address self-review — 5 bugs in user abstraction

1. Reorder: disconnect() before soft_verify() — agent process is
   already dead when soft_verify runs, so soft_verify's docstring
   was misleading. Now disconnect → soft_verify is the explicit flow.

2. soft_verify() now runs CLEANUP_CMD (conftest/pth/sitecustomize
   purge) before the verifier. Prevents agent from gaming intermediate
   test results by injecting test-patching files.

3. FunctionUser: use inspect.isawaitable() instead of
   asyncio.iscoroutine() — handles asyncio.Task, Future, and any
   __await__ object, not just coroutines.

4. oracle_access: cat /solution now runs as user="root" — /solution
   is locked (root:700) after install_agent, so the read would
   silently fail without root.

5. try/finally around connect/execute/disconnect in user loop —
   ensures disconnect() always runs even if execute() raises.

* feat: add user_dogfood.py — progressive disclosure on regex-log

Demonstrates the FunctionUser abstraction:
- Round 0: terse 2-sentence prompt
- Round 1: hints about edge cases on failure
- Round 2: full instruction on continued failure
- Stops early if tests pass

* fix: address Devin review — remove tautological tests, fix model name

- Remove 4 tautological tests (pure dataclass reads) per CLAUDE.md
  convention: TestRoundResult.test_defaults, test_with_data,
  TestTrialConfigUser.test_user_field_defaults_to_none, test_user_field_set
- Fix dogfood model name: gemini-2.5-flash (not expired preview)
- Note: iscoroutine→isawaitable was already fixed in 51d6c61

* fix: address code review — oracle safety, unused import, soft_verify tests

1. Oracle /solution is now moved (not deleted) before agent runs and
   restored before final verify(). Prevents breaking verifiers that
   need /solution to compute rewards.

2. Remove unused asyncio import from user.py.

3. Add 4 soft_verify tests: timeout, crash, success, and CLEANUP_CMD
   execution verification. soft_verify is no longer untested.

* feat: dogfood results — progressive disclosure on regex-log via Daytona

3-round progressive disclosure with Gemini Flash on regex-log:
  Round 0: terse prompt (2 tool calls) → reward=0.0
  Round 1: hint prompt  (3 tool calls) → reward=0.0
  Round 2: full instruction (3 tool calls) → reward=0.0
  Final verify: reward=0.0

Agent scored 0.0 on all rounds — regex-log is a hard task. But the
infrastructure works end-to-end: user loop, soft_verify, fresh ACP
sessions per round, user_rounds.jsonl persistence, final hardened
verify. No errors.

* feat: add opencode agent to registry

OpenCode (opencode-ai) is an open-source TypeScript coding agent with
ACP support. Skills path: $HOME/.opencode/skills (updated from
.opencode/skill per skillsbench #718).

Closes skillsbench #718 for the benchflow side.

* fix: opencode ACP returns 0 tool calls — model format mismatch

Root cause: OpenCode's ACP parseModel() splits modelId on "/" to extract
providerID and modelID. When benchflow sent "gemini-3.1-pro-preview"
(no slash), opencode parsed it as providerID="gemini-3.1-pro-preview"
with modelID="" — an invalid config that silently returned end_turn.

Fix: Add acp_model_format field to AgentConfig. When set to
"provider/model" (opencode), _format_acp_model() infers the models.dev
provider from the bare model name (e.g. "gemini" → "google") and sends
"google/gemini-3.1-pro-preview" to set_model.

Also: opencode requires_env is now empty (inferred from model at
runtime, not hardcoded to ANTHROPIC_API_KEY).

* feat: executed notebook — SWE-bench Pro progressive disclosure analysis

OpenCode + gemini-3.1-pro-preview on qutebrowser SWE-bench Pro:

Baseline (full prompt, 1 round): 40 tools, 736s, reward=0.0
Progressive (3 rounds):          185 tools, 1154s, reward=0.0
  Round 0 (terse):     86 tools (81 bash + 5 edit)
  Round 1 (hints):     76 tools (66 bash + 10 edit)
  Round 2 (full):      23 tools (16 bash + 7 edit)

Both scored 0.0 due to verifier infrastructure bug (rootdir=/tests
instead of /app, pytest couldn't find config). Agent's fixes were
likely correct — demonstrated passing tests in own environment.

Key findings:
- Progressive disclosure changed agent behavior (86→76→23 tools)
- _reset_cache implemented only after Round 1 hint
- OpenCode handled 185 tool calls without token limits
- Verifier rootdir bug needs investigation

* fix: replace hand-curated pytest plugin whitelist with auto-discovery

The old mechanism (4 dicts + 4 functions + 1 regex) required manual
code changes for every new benchmark with an undeclared pytest plugin.
SWE-bench Pro tasks failed because pytest-benchmark wasn't whitelisted.

New mechanism: one container-side script + one async function. At
hardening time, enumerate all pytest11 entry points from root-owned
system packages. Only root-owned dist-info directories are trusted —
editable installs from agent-writable /testbed are excluded.

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 stays in place. Security preserved.
task.toml pytest_plugins kept as fallback.

Deleted: _PYTEST_PLUGIN_ALIASES, _PYTEST_OPTION_PLUGINS,
_PYTEST_INSTALLED_PLUGINS, _PIP_INSTALL_RE, _normalize_pytest_plugin,
_plugins_from_verifier_script, _declared_pytest_plugins,
_pytest_plugin_flags, tomllib import.

Added: _DISCOVER_PYTEST_PLUGINS_SCRIPT, _discover_pytest_plugin_flags.

* fix: handle Python 3.9 importlib.metadata API in plugin discovery

Python 3.9's entry_points() doesn't accept keyword arguments — returns
a dict instead. Fall back to entry_points().get('pytest11', []) when
the keyword style raises TypeError.

* fix: simplify plugin discovery — skip ownership check

The uid==0 check was failing on Python 3.9 containers where
ep.dist._path doesn't exist. Simplified to just enumerate all
pytest11 entry points — sandbox_user prevents agent pip installs,
so all discovered plugins are image-authored.

* feat: updated notebook with fixed-verifier results

Both progressive + baseline rerun with working verifier (15 plugins
discovered). Results with honest scoring:

Progressive (3 rounds): 284 tools, 970s, reward=0.0
  Round 0: 94 tools, Round 1: 92 tools, Round 2: 98 tools
Baseline (1 round):     73 tools, 611s, reward=0.0

Both failed due to agent code errors (circular imports), not
verifier infrastructure. Progressive used 4x more compute for
same outcome on this task.

* fix: preserve trusted PYTHONPATH entries during verifier hardening

VERIFIER_ENV cleared PYTHONPATH="" which broke SWE-bench Pro tasks
where the Dockerfile sets PYTHONPATH=/app/lib:/app for project imports.

New: _trusted_verifier_pythonpath() filters PYTHONPATH using the same
root-owned validation as PATH, but does NOT block the workspace —
/app is already importable via CWD/pytest sys.path insertion, so
clearing it only breaks imports without security benefit. /tmp,
/var/tmp, /home/agent are still blocked.

Re-pinned after task-env merge like PATH.

* fix: address review comments on BaseUser PR

- soft_verify: chmod 777 /logs/verifier so non-root verifier can write
- soft_verify: restore /solution before verify, re-hide after (oracle access)
- validate empty roles (!=1) and multi-scene configs in user loop
- remove tautological test_setup_is_noop
- remove opencode BENCHFLOW_PROVIDER_API_KEY→ANTHROPIC_API_KEY mapping
  (wrong for non-Anthropic models; native keys inherited via auto_inherit_env)
- warn on unknown provider fallback in _format_acp_model
- remove --rootdir=/tests from VERIFIER_ENV (cherry-pick from PR #187)
- fix printenv PYTHONPATH crash when unset
- fix stale plugin discovery docstring

* feat: add SWE-bench Pro oracle validation + baseline experiment script

Runs oracle (gold solution) on all 4 testable tasks to verify the
--rootdir fix, then runs a single-round agent baseline for comparison
with progressive disclosure. Results to CSV.

* fix: address Codex review on PR #184 — oracle safety + warnings

Three Codex review findings on the BaseUser abstraction:

1. oracle_access=True with user=None silently leaves /solution exposed to
   the agent for the entire trial. Add a logger.warning at setup time so
   misconfigurations surface immediately.

2. Oracle restore (mv /solution_oracle_backup /solution) was outside any
   finally block. If _run_user_loop() raised, /solution was never restored.
   Move the user/scene execution into try/finally so the restore always
   runs before the final verify().

3. Oracle read used a wildcard fallback (cat /solution/* || true) that
   could leak unintended files (binaries, credentials). Narrow to
   solve.sh — the canonical SWE-bench Pro oracle path.

Bugs Codex flagged that were FALSE POSITIVES (verified against code):
  - "session counter reset" — disconnect() already resets both counters
  - "None instruction" — _resolve_prompts returns [instruction] not [None]

Tests still pass: 15 user + 58 sandbox = 73 total.

* feat: per-task verifier hardening opt-outs + restore --rootdir=/app

Two related changes addressing SWE-bench Pro oracle compatibility:

1) Restore --rootdir=/app in PYTEST_ADDOPTS

   Removing --rootdir entirely (PR #187) made pytest fall back to /dev as
   rootdir (from -c /dev/null), producing test node IDs like ../dev/::test_foo
   instead of <repo>/<path>::test_foo. The verifier expects full-path IDs and
   reported 0 passing tests on openlibrary even though all 18 tests passed.

   --rootdir=/app anchors test IDs to the canonical Harbor repo root while
   -c /dev/null still blocks pyproject.toml/pytest.ini discovery and
   --confcutdir=/tests still blocks conftest walk-up beyond /tests.

2) Per-task [verifier.hardening] opt-outs in task.toml

   The cleanup that deletes agent-injected conftest.py also deletes
   legitimate repo conftest.py files. qutebrowser ships conftest.py that
   sets up import order to break a real circular dependency between
   qutebrowser.browser.inspector and qutebrowser.misc.miscwidgets — without
   them, pytest collection fails on a type annotation in miscwidgets.py:419.

   Tasks now declare opt-outs in task.toml:

       [verifier.hardening]
       cleanup_conftests = false  # qutebrowser

   Defaults remain secure (all True). New helpers in _sandbox.py:

   - HARDENING_DEFAULTS: dict of feature flags
   - _read_hardening_config(task_dir): parse task.toml [verifier.hardening]
   - _build_cleanup_cmd(hardening): build cleanup honoring opt-outs

   CLEANUP_CMD constant kept as backward-compat alias.

   Both harden_before_verify() and Trial.soft_verify() now read per-task
   hardening config before running cleanup.

Validation on SWE-bench Pro oracle (Daytona):

  Before: 2/4 (ansible, flipt) — openlibrary failed test ID format,
                                  qutebrowser failed conftest deletion
  After:  5/5 (ansible, flipt, openlibrary, qutebrowser, navidrome)

Tests: 80 passing (15 user + 65 sandbox including 7 new opt-out tests).

* docs: add progressive-disclosure guide + SWE-bench Pro demo notebook

For Josh's SWE-bench Pro use case (and Harbor #1316 parity in the
no-second-LLM case):

- docs/progressive-disclosure.md: dedicated guide for the BaseUser
  abstraction. Covers the API, oracle access, [verifier.hardening]
  opt-outs, and when to choose BaseUser vs multi-role Scene.

- docs/use-cases.md: brief mention in §1 (Interactive User Simulation)
  pointing to progressive-disclosure.md for the lighter-weight
  callback-based pattern.

- examples/swebench_pro_progressive_disclosure.ipynb: clean rewrite of
  the existing notebook. Shows the API, oracle 5/5, baseline 4 tasks,
  per-task hardening opt-out example, and a placeholder cell that auto-
  loads the latest progressive-disclosure run from
  /tmp/swebench-pro-jobs/progressive when one exists. Executes top-to-
  bottom against the current oracle/baseline CSV.

- examples/swebench_pro_user_dogfood.py: ready-to-run script for
  progressive disclosure on any of the 5 working SWE-bench Pro tasks.
  Three-round user: terse → failing tests + half spec → full spec.

- experiments/swebench-pro-results.csv: oracle + baseline results from
  2026-04-24 Daytona run. qutebrowser entry is pre-fix (verified post-
  fix separately, noted in notebook).

* docs: add progressive-disclosure.md to CLAUDE.md docs index
…gs retry (#196)

* Bump DaytonaPtyProcess readline timeout 300s→900s

Long-running TTS/audio tasks (e.g. pg-essay-to-audiobook) generate
extended quiet periods on stdout while ffmpeg/whisper run. The 300s
PTY readline timeout fires before the per-task agent timeout (900s),
prematurely killing healthy runs.

Align readline timeout with the standard agent timeout so the PTY
only fails when the inner process is actually wedged.

* Daytona SDK: retry SessionCommandLogsResponse ValidationError

The Daytona server occasionally returns an empty string instead of a
JSON object when fetching session command logs, which causes pydantic
to raise ValidationError inside AsyncProcess.get_session_command_logs.
We've reproduced this on SDK 0.168.x and 0.169.x; the surface is most
visible in skillsbench tasks that ask the verifier for command output
(e.g. latex-formula-extraction).

Patch the SDK method at runtime with a small bounded retry. After
four malformed payloads we fall back to an empty (but valid) response
so callers can still inspect exit_code via get_session_command — a
silent missing-logs is preferable to taking a whole trial as ERROR
on an upstream marshalling glitch.

Patch is applied lazily from _create_environment so we never touch
the SDK on Docker-only runs.

* Daytona retry: catch DaytonaError wrapping the malformed-logs ValidationError

The first version of this patch only matched on pydantic ValidationError,
but AsyncProcess.get_session_command_logs is decorated by intercept_errors
at class-definition time — every inner exception is converted to
DaytonaError before our patched bound method ever sees it. Verified
against latex-formula-extraction on Daytona: the patch wrapper was being
called, but the except-clause never matched, so the run still failed.

Match on DaytonaError whose message contains 'SessionCommandLogsResponse'
in addition to bare ValidationError, and drop the wrapper to 2 attempts
(harbor already wraps the call in tenacity x3 — extra retries here are
wasted on a deterministic malformed payload). Empty-fallback unchanged.
* fix: env-file path mismatch in DinD compose mode

Devin caught a real bug introduced by PR #193 (DinD compose ACP):
src/benchflow/process.py:325 sets remote_env_path = "/tmp/benchflow_env_$$.env"
expecting the remote shell to expand $$ to its PID. But shlex.join() at
line 329 single-quotes the --env-file argument, so docker compose receives
the literal string "/tmp/benchflow_env_$$.env" while the cat heredoc that
writes the file (line 339, raw f-string) does expand $$. The file is
written to /tmp/benchflow_env_<pid>.env and read from /tmp/benchflow_env_$$.env
— silent mismatch, env vars (incl. API keys) silently dropped in DinD
compose tasks.

Fix: use uuid.uuid4().hex[:16] for the unique suffix instead of relying on
shell-side $$ expansion. The path is then a literal that survives quoting.
Apply the same fix to the direct (non-DinD) Daytona branch even though it
was working — uniformity makes the path robust against future quoting
changes.

Also fix a pre-existing SIM103 lint error in _daytona_patches.py that
ruff caught while validating the test changes.

Tests: tests/test_process.py +2 regression tests pinning that no remote
command contains a literal "$$" — would catch this exact regression.
8/8 process tests pass; ruff clean.

* test: reference PR #193 / #198 in regression test docstring

Devin caught: CLAUDE.md mandates regression tests name the commit/PR
they guard. Updated TestDaytonaProcessEnvFilePath docstring to cite
PR #198 (the fix) and PR #193 / commit cdccac7 (the regression).
# Conflicts:
#	src/benchflow/_agent_env.py
#	src/benchflow/cli/eval.py
#	tests/test_oracle_chokepoint.py
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 8 additional findings.

Open in Devin Review

@xdotli xdotli merged commit 23b4de4 into main Apr 25, 2026
2 of 3 checks passed
@xdotli xdotli deleted the dev-0.3 branch April 25, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants